Regressions and Identification

Data Analytics for Finance

Caspar David Peter

Rotterdam School of Management, Accounting Department

Regressions and Identification

Regressions and Identification

Today’s Journey

  • Identification: What it means and why it matters
  • The identification problem: Endogeneity
    • Selection bias
    • Omitted variable bias
  • DiLLMa: seeing the problem in action
  • The gold standard: Experiments
  • Ordinary Least Squares (OLS) regression

Literature

  • Huntington-Klein (2022) (Chapters 5 & 13)
  • Verbeek (2021) (Chapters 2.1)

Regressions and Identification

Recap - Where did we leave things?

Last time: The Research Pipeline

Raw & Clean Data, Visualization

  • From messy to tidy data
  • Exploratory data analysis with focus on visualization
Today’s Menu
  • Identification & OLS regression
  • Experiments
  • DiLLMa (example)

Regressions and Identification

Learning Objectives

By the end of today, you will be able to

  • Understand the concept of identification in causal inference
  • Recognize endogeneity issues such as selection bias and omitted variable bias
  • Explain the importance of random assignment in experiments
  • Describe the basics of Ordinary Least Squares (OLS) regression
  • Interpret OLS regression output and understand when estimates are trustworthy

Hands-on Practice

In Assignment 3, you’ll apply these concepts: run OLS regressions, test assumptions, and create publication-quality tables.

Regressions and Identification

Introduction to Identification

Key concepts

Concept Meaning Example
Data‑Generating Process (DGP) The complete set of rules that determine how the data you observe are created Newton’s law of gravity
Variation The differences in a variable’s value across observations. How people’s incomes differ by hair color across individuals.
Identification The process of ensuring that the variation you exploit is causal, not a spurious alternative Chocolate consumption and Nobel laureates (this is spurious)

Identification

Why is identification important?

Identification is the bridge between:

  1. Theory / prior knowledge (what we already know about the DGP)
  2. Data (what we observe)
  3. Causal or associational conclusions that are scientifically defensible

Identification

What is identification?

Definition

  • Identification refers to the process of determining the specific variation in data that answers a research question
  • It involves isolating the part of the data that reflects the causal relationship of interest

Example

  • Estimating the effect of LLM use on exam scores1
  • Need to isolate variation in LLM use that is not confounded by other factors (e.g., student ability)

Identification

What are the ingredients for identification?

Key Assumptions

Data Generating Process (DGP)

Assumes that observations are produced by underlying laws or processes

Theory and Assumptions

Used to identify which parts of the DGP explain the data and refine research designs

Limiting Explanation

Assumptions help block non-essential variations, focusing on the aspect of data that aligns with the research question.

Identification

Key Assumptions in DGP and Identification

Known & Unknown

Part of DGP is understood (helps form assumptions); part is unknown (area of exploration)

Theoretical Framework

Research relies on known aspects of DGP to interpret data and identify genuine causal effects

Identification Strategy

Techniques like controlling for variables, subgroup analyses, or hypothetical scenarios help test specific segments of DGP

Assumption-Based Progress

Continued research and empirical testing refine and validate assumptions about parts of the DGP

The Identification Problem: Endogeneity

Endogeneity

What can go wrong?

When we try to estimate causal effects, we face a fundamental challenge:

Endogeneity occurs when the explanatory variable (X) is correlated with the error term (ε) in our model.

This correlation can arise from several sources… let’s focus on two of the most common1

Endogeneity

Source 1: Selection Bias

Definition

Selection bias occurs when individuals [self-select[ into treatment based on characteristics that also affect the outcome.

The problem

  • People who choose to do X are [systematically different[ from those who don’t
  • These differences—not X itself—may explain the outcome

Classic examples

  • Do hospitals make people healthier? (Sick people go to hospitals)
  • Does college increase earnings? (High-ability students attend college)
  • Do LLMs improve exam scores? (Tech-savvy students use LLMs)

Endogeneity

Source 2: Omitted Variable Bias (OVB)

Definition

Omitted variable bias occurs when a relevant variable that affects both the explanatory variable (X) and the outcome (Y) is left out of the analysis.

The problem

  • The omitted variable “confounds” the relationship
  • We attribute its effect to X, biasing our estimate

Formal condition for OVB

A variable Z causes OVB if:

  1. Z affects Y (Z → Y)
  2. Z is correlated with X (Z ↔︎ X)
  3. Z is not included in the model

The connection

Selection bias is one cause of OVB. When people self-select based on characteristic Z (which also affects Y), and we omit Z from our model, we have OVB.

Endogeneity

Visualizing the Problem

Key insight

If ability affects both LLM use and exam scores, any naive comparison of LLM users vs. non-users will be confounded.

Endogeneity

Why does this matter?

The consequence

Endogeneity can:

  • Inflate the estimated effect (make X look better than it is)
  • Deflate the estimated effect (make X look worse than it is)
  • Flip the sign of the estimated effect (make X look beneficial when it’s harmful, or vice versa)

The challenge

We often cannot directly observe the confounding variables!

Let’s see this problem in action with a concrete example…

Large Language Models (LLMs) & Exam Scores (DiLLMa)

Disclaimer

I created this example (incl. the data) for educational purposes only. It does not represent real individuals or actual events.

“DiLLMa”

What’s the story?

Motivation

  • LLMs are increasingly used by students e.g. for exam preparation
  • There is a debate about whether LLM use improves or hinders learning outcomes
  • Understanding the causal effect of LLM use on exam scores is crucial for educators and policymakers

Identification challenge

  • We have observational data on students’ LLM use and their exam scores
  • We need to identify the causal effect of LLM use on exam scores, accounting for potential confounders like student ability, study habits, and demographics

Research question

What is the effect of LLM use on students’ exam scores?

Libby Boxes

Let’s go!

Identification

What could possibly go wrong?

Naive comparison

What do we find?

  • Students who used LLMs scored higher on average
  • Naive comparison suggests a positive effect of LLM use on exam scores
  • Economic effect: ~0.5 point increase in exam scores

But…

Is this effect causal? Or are there confounding factors at play?

DiLLMa

Step 2: Looking at Observable Characteristics

Variables in the dataset

Variable LLM No LLM Difference t-stat. p-value
Exam score 7.04 6.55 0.50 -5.99 0.00
Attendance rate 74.15 75.43 -1.28 1.42 0.15
Female 0.49 0.50 -0.01 0.39 0.70
Age 21.37 21.34 0.03 -0.18 0.86
Study hours 20.09 20.34 -0.25 0.35 0.73

“Eye test”

  • The LLM group has a higher average exam scores
  • The groups are comparable on observable characteristics, on average1
  • However, there may be unobservable factors (e.g., innate ability, motivation) that differ between the groups and affect exam scores

What unobservable factors could play a role?

Do you have any ideas? What could be part of the data generating process that we do not observe in the data?

Think about it for a moment…

DiLLMa

Step 3: What about ability?

What happens if we could observe and condition on (or control for) ability?

DiLLMa

Step 4: The Reveal

Controlling for ability1

What do we find?

  • Students who used LLMs scored lower on average
  • Naive comparison suggested a positive effect → now it’s negative
  • Economic effect: ~0.5 point decrease in exam scores

Do unobservable factors play a role?

Let’s dissect the results

Conditional means: -0.31 to -0.26

Group LLM No LLM Difference
Low ability 5.83 6.13 -0.31
High ability 7.45 7.72 -0.26

Group composition

Group LLM No LLM Difference
Low ability 122 379 -257
High ability 364 135 229

Key insight

  • Within high-ability students: LLM users do worse
  • Within low-ability students: LLM users do worse

Do unobservable factors play a role?

Let’s dissect the results

Unconditional means

\[ \text{LLM users: }\frac{122 \times 5.83 + 364 \times 7.45}{486} \approx 7.04 \] \[ \text{Non-users: }\frac{379 \times 6.13 + 135 \times 7.72}{514} \approx 6.55 \]

Naive effect: +0.49 (positive!)

LLMs hurt everyone — we were just fooled by who chose to use them.

What did we just discover?

What did we just discover?

Recap of findings

Results summary

  1. Naive test: positive effect → “LLMs improve grades!”
  2. Add ability control: effect flips negative → “Wait, no they don’t”
  3. Conditional means plot: the reveal—LLMs hurt everyone, we were just fooled by who chooses to use them

Key insight

This is Selection Bias

The groups differed systematically on ability. High-ability students selected into LLM use — and they would have done well anyway.

Connecting the Dots: This is Endogeneity!

DiLLMa

What we just witnessed

The pattern

Analysis LLM Coefficient Interpretation
Naive comparison Positive LLMs help!
Control for ability Negative LLMs hurt!

The explanation

This is a textbook case of selection bias manifesting as omitted variable bias:

  1. Selection bias: High-ability students self-select into LLM use
  2. OVB: When ability is omitted, its effect is attributed to LLM use
  3. Result: The naive estimate is biased upward

Key takeaway

Naive comparisons can be misleading due to endogeneity issues like selection bias and OVB!

DiLLMa

The Sign-Flip Phenomenon

Critical insight

Omitted variable bias doesn’t just make estimates imprecise—it can completely flip the sign of your estimate!

In our example:

  • True effect of LLM use: Negative (LLMs hurt learning)
  • Naive estimate: Positive (because high-ability students use LLMs)
  • Bias: Positive (ability is positively correlated with both LLM use and exam scores)

The formula (intuition):

\[\text{Naive estimate} = \text{True effect} + \text{Bias}\] \[\text{Positive} = \text{Negative} + \text{(Large) Positive}\]

What can we do about it?

The gold standard: Experiments

Why is randomization so important?

  • Controlling variation in the causal variable, e.g. LLM use via random assignment

  • Makes sure that the treatment and control group are similar along observable and unobservable dimensions

  • The only difference between the two groups is the treatment

  • This allows us to attribute any difference in outcomes to the treatment

  • No selection bias (endogeneity issue)

The gold standard: Experiments

Designing and analyzing experiments

Types of experiments:

  • Field experiments
    • aim to be as similar as possible to real-world decision situations
  • A/B testing
    • aim to evaluate different versions of the same product
  • Lab experiments
    • are carried out in an artificial environment, usually a computer lab

The gold standard: Experiments

Setup of an experiment

  • Random assignment of treatment via assignment rule
  • Number of subjects? Ideally large, exact number via power analysis
    • How many subjects do I need to detect a certain effect size?
  • Proportion of treated subjects? Ideally 50/50
  • Covariate balance
    • Did random assignment work?

The gold standard: Experiments

DiLLMa experiment

  • Randomly assign students to LLM use or no LLM use
  • Ensure groups are balanced on observable characteristics (e.g., prior GPA, study habits)
  • Measure exam scores after the treatment period
  • Analyze the difference in exam scores between the treatment and control groups

The gold standard: Experiments

DiLLMa experiment

Random assignment of LLM use

Naive comparison

What do we find?

  • Students who used LLMs scored lower on average
  • Naive comparison suggests a negative effect of LLM use on exam scores
  • Economic effect: ~0.5 point decrease in exam scores

But…

Did random assignment work? Are the groups balanced on observable & unobservable characteristics?

The gold standard: Experiments

DiLLMa experiment

Covariate balance check

Variable LLM No LLM Difference t-stat. p-value
Exam score 6.43 7.22 -0.80 8.75 0.00
Attendance rate 74.62 75.01 -0.39 0.43 0.66
Female 0.49 0.49 0.00 -0.02 0.98
Age 21.29 21.43 -0.13 0.91 0.36
Study hours 19.92 20.56 -0.64 0.89 0.37
Ability 0.01 0.03 -0.02 0.33 0.74

The gold standard: Experiments

Takeaways

  • Random assignment assures that effect is not due to selection bias
  • In the experiment you control
    • who receives the treatment (the treatment group),…
    • and who doesn’t (the control group)
  • You can control other aspects of the situation to avoid other confounders
  • You can measure the effect of the treatment on the outcome
  • Potential trade-off between internal and external validity

What to do if controlling random assignment is not possible?

What to do if controlling random assignment is not possible?

Reality check

The reality of finance research

  • Most financial phenomena cannot be experimentally manipulated
  • We rely on observational data
  • Need tools to make the best of imperfect situations

Two approaches

OLS with controls (today)
  • Add observable confounders as control variables
  • Limited: can’t control for unobservables
Quasi-experimental methods (Lecture 4)
  • Exploit natural variation
  • Difference-in-Differences, etc.

Ordinary Least Squares (OLS)

Ordinary Least Squares (OLS)

The Skinny

What is OLS regression?

  • OLS is the workhorse of explanatory analytics
  • It lets us answer “What would happen to the expected value of \(Y\) if we changed \(X\) by one unit?”

Prediction vs explanation

  • Prediction: use \(X\) to forecast \(Y\)
  • Statistical explanation: describe the relationship between them via a functional form such as \[ Y = \beta_0 + \beta_1 X + \varepsilon. \]
  • The coefficient \(\beta_1\) is the slope — a one‑unit change in \(X\) is associated with a \(\beta_1\) change in \(Y\).

Ordinary Least Squares (OLS)

The Skinny

Choosing a shape for the relationship

Linear: \[ Y = \beta_0 + \beta_1 X. \]

“Linear‑in‑parameters” (e.g., quadratic): \[ Y = \beta_0 + \beta_1 X + \beta_2 X^2. \]

Added variables and control (multi‑predictor/multiviriate OLS)

\[ Y = \beta_0 + \beta_1 X + \beta_2 Z + \varepsilon, \] the coefficient on \(X\) is estimated using only the variation in \(X\) that is left after regressing \(Z\) out of both \(X\) and \(Y\) — that is, you “control for” \(Z\).

Ordinary Least Squares (OLS)

The Skinny

When does it work perfectly (OLS Assumptions)?

Assumption What it requires Role
Linearity \(E[Y \mid X] = \beta_0 + \beta_1 X\) Correct specification of functional form
Exogeneity \(E[\varepsilon \mid X] = 0\) Ensures \(\hat{\beta}\) is unbiased and consistent
Homoscedasticity \(Var(\varepsilon \mid X) = \sigma^2\) Standard errors are efficient and unbiased
Independence Observations are i.i.d. Standard error formulas are valid
Normality (small samples) \(\varepsilon \sim N(0, \sigma^2)\) Validates t-tests and confidence intervals

Ordinary Least Squares (OLS)

The Skinny - For those who prefer plain English

When does it work perfectly (OLS Assumptions)?

Assumption Plain English
Linearity The relationship between X and Y is a straight line
Exogeneity X is not correlated with anything else that affects Y
Homoscedasticity The spread of errors is the same across all values of X
Independence One observation doesn’t influence another
Normality (small samples) Errors follow a bell curve—only matters for hypothesis testing with few observations

Ordinary Least Squares (OLS)

The Skinny

What if assumptions are violated?

Assumption If violated How to check
Linearity Estimates are biased; wrong model Plot residuals vs fitted values
Exogeneity Estimates are biased and inconsistent Theory and DAGs; no direct test
Homoscedasticity Standard errors are wrong (often too small) Plot residuals vs fitted; Breusch-Pagan test
Independence Standard errors are wrong Durbin-Watson test; check data structure
Normality t-tests and CIs invalid in small samples Q-Q plot of residuals

Ordinary Least Squares (OLS)

The Skinny

What else could go wrong?

Issue If present How to check
Multicollinearity Estimates unbiased but imprecise; unstable coefficients Correlation matrix; VIF > 10
Outliers Single observations can distort estimates Summary stats; plots
Measurement error in X 1 Coefficient biased toward zero (attenuation bias) Theory; compare multiple measures if available
Small sample size Estimates imprecise; normality assumption matters more Check n relative to number of predictors
Missing data Bias if not missing completely at random Check patterns; compare complete vs incomplete cases

Ordinary Least Squares (OLS)

What does it look like?

Plot

Table output

Ordinary Least Squares (OLS)

What does the output tell me?

Component Meaning Typical notation
Rows (Coefficient, Standard Error) Estimate of \(\beta_j\) and its SE (or t‑stat) -0.021 (0.004)
Significance stars Indicates p‑value thresholds *** = p < .01
\(R^2\) Proportion of variance in \(Y\) explained by the model 0.065
Adjusted \(R^2\) Corrects \(R^2\) for number of predictors 0.065
F‑statistic Joint test that all non‑constant coefficients equal zero 35.016
RMSE Standard deviation of residuals (or “root mean squared error”) 1.307

Ordinary Least Squares (OLS)

How to interpret the output?

Table output

Interpretation

Precision: Given SE=0.004, even small differences in coefficients are detectable

Significance: *** indicates the coefficient differs from zero at the 1% level

Model fit: \(R^2\) of 0.065 tells us only ~ 3% of the variation in scores is captured by the predictors

t-stat: The slope is about 5 standard errors away from zero (0.021/0.004 = 5), indicating strong evidence against the null hypothesis of no effect

Typically, we focus on the coefficient estimates, their precision (SEs), and significance levels to interpret OLS results.

Ordinary Least Squares (OLS)

Takeaways

  • The error term separates what is explained by the predictors from everything else
  • Exogeneity is the linchpin that turns OLS estimates into unbiased causal indicators
  • Sampling variability is quantified by standard errors, which underpin hypothesis tests, p‑values, and confidence intervals
  • Regression tables encapsulate all of this: coefficients, precision, significance, model‑fit metrics

OLS DiLLMa: Putting It All Together

OLS DiLLMa

DiLLMa OLS results

OLS DiLLMa

DiLLMa OLS results

Prior findings recap

Reconciling OLS with comparison of means

  • The coefficient on llm in the naive OLS regression is equivalent to the difference in means between LLM users and non-users
  • Adding controls adjusts for confounding variables, changing the estimated effect of LLM use on exam scores
  • Including ability further refines the estimate by accounting for this key confounder and gives us the “true” effect of LLM use
  • Random assignment eliminates confounding, so the naive estimate from the randomized data reflects the causal effect of LLM use on exam scores
  • In an OLS setting we refer to the selection bias issue as omitted variable bias (OVB)

OLS DiLLMa

The Limits of “Controlling For”

Important caveat

In DiLLMa, we could observe ability and control for it. In real research, we usually cannot observe all confounders.

The problem with observational data:

What we did in DiLLMa What happens in reality
Observed ability Ability is unobserved
Controlled for it in OLS Can’t control for what we don’t see
Got the “true” effect Estimate remains biased

The solution?

  • Experiments (when possible)
  • Quasi-experimental methods (when experiments aren’t possible) → Lecture 4

Conclusion

Conclusion

Key takeaway

Conclusion

What we learned today

Identification

  • Identification = isolating causal variation
  • Endogeneity = X correlated with error term
  • Two main sources: selection bias & OVB
  • Can flip the sign of estimates!

Tools

  • Experiments: Gold standard, but often infeasible
  • OLS with controls: Works if you observe all confounders
  • Next lecture: What if you can’t experiment AND can’t observe all confounders?

Conclusion

Looking Ahead

Coming up: Lecture 4 - Panel Data Methods

When we can’t run experiments AND controlling for observables isn’t enough, we need quasi-experimental methods:

  • Difference-in-Differences (DiD)
  • Exploits variation across groups AND time
  • Can identify causal effects without experiments

We’ll continue the DiLLMa story with panel data!

Thank You for Your Attention!

See You in the Next One!

References

Békés, Gábor, and Gábor Kézdi. 2021. Data Analysis for Business, Economics, and Policy. Cambridge University Press.
Huntington-Klein, Nick. 2022. The Effect: An Introduction to Research Design and Causality. 2nd ed. Chapman; Hall/CRC.
Verbeek, Marno. 2021. Panel Methods for Finance: A Guide to Panel Data Econometrics for Financial Applications. De Gruyter.

Appendix

Libby Boxes

DiLLMa

Back

The gold standard: Experiments

DiLLMa experiment

Covariate balance check - no random assignment!

Variable LLM No LLM Difference t-stat. p-value
Exam score 7.04 6.55 0.50 -5.99 0.00
Attendance rate 74.15 75.43 -1.28 1.42 0.15
Female 0.49 0.50 -0.01 0.39 0.70
Age 21.37 21.34 0.03 -0.18 0.86
Study hours 20.09 20.34 -0.25 0.35 0.73
Ability 0.56 -0.50 1.07 -20.11 0.00

Theory, hypotheses, and operationalisation

What is a hypothesis?

  • A hypothesis is a proposed explanation for a phenomenon or a prediction of a possible causal correlation
  • It is a tentative answer to a research question that guides the direction of study and investigation
  • Its formulation is guided by existing (economic) theory
  • A well-constructed hypothesis is testable, meaning it can be supported or rejected through experimentation

Theory, hypotheses, and operationalisation

How do we test a hypothesis?

  • We need a baseline to test against: The null hypothesis (\(H_0\))
  • Why null? It states that there is no effect or no difference, which serves as a baseline assumption
    • \(H_0\) is what you test your \(H_1\) against, e.g. \(H_0 = 0\)
    • \(H_1\) is the/your alternative hypothesis - what you might believe to be true
  • A statistical test rejects \(H_0\) if there is enough evidence against it
    • If we reject \(H_0\), we accept \(H_1\)

The t-test measures how far the sample mean is away from \(H_0\)

Hypothesis:

  • One-sided (\(H_0 \leq 0\), \(H_1>0\)): \(t = \frac{X-0}{SE(X)}\)
  • Two-sided (\(H_0 = 0\), \(H_1 \neq 0\)): \(t = \frac{X_{treated}-X_{control}}{SE(X_{treated}-X_{control})}\)